MEDB 5505, Module03

2025-02-01

Topics to be covered

  • What you will learn
    • Reading text files
    • Comma delimited files
    • Tab delimited files
    • Other delimiters
    • Fixed width files
    • Real world examples
    • Your programming assignment

Text files, 1

  • Advantages
    • Easy import into many programs
    • Review using notepad
  • Disadvantages
    • Bigger size
    • Slower to import

Text files, 2

  • Wide range of formats
    • Delimited
    • Fixed width
  • First row for variable names
    • Optional but recommended
  • Always look for a data dictionary

Should I download before reading?

  • Read directly from website
    • Convenient
    • Updates incorporated at each run
  • Download then read
    • Downloaded file doesn’t disappear
    • Avoid repeated long downloads
    • Work even when Internet connection is down

No data dictionary?

  • Peek at file
    • Same number of delimiters on each line
    • Tabs versus multiple blanks are hard to distinguish

No data dictionary?

  • Experiment
    • Read warnings carefully
  • If needed, edit the file manually
    • Simple edits of one or two offending lines
    • Global search and replace
    • Change tabs to blanks
    • Change multiple blanks to single blank

Troubleshooting

  • Some warning signs that it didn’t work.
    • Multiple data read in as single variable.
    • Lots of missing values
    • Bottom looks different than top

Break #1

  • What you have learned
    • Reading text files
  • What’s coming next
    • Comma delimited files

The readr library

  • Part of tidyverse
  • For importing text files
  • Broad range of formats
  • Very fast
  • Makes intelligent guesses

Useful functions in the readr library

  • read_csv(): comma-separated values
  • read_tsv(): tab-separated values
  • read_delim(): arbitrary delimiter
  • read_fwf(): fixed-width files
  • read_table(): whitespace-separated files

Arguments for most readr functions

  • col_names =
    • TRUE, FALSE, or a vector of names
  • col_types =
    • “n” for numeric, “c” for character, “?” for guess
  • na =
    • Defaults to “NA”
  • skip =
    • How many rows to skip (defaults to 0)

An example of a comma delimited file

x,y
1,4
2,8
3,12
4,16

The read_csv function

simple_comma <- read_csv(
  file="../data/simple.csv",
  col_names=TRUE,
  col_types="nn")

glimpse(simple_comma)

Live demonstration, 1

Now, you will see a live demonstration of part 1 of simon-5505-03-demo.

Break #2

  • What you have learned
    • Comma delimited files
  • What’s coming next
    • Tab delimited files

The evil tab character

  • Jumps to a specific location
    • Location varies from program to program
  • Looks like multiple blanks, but is a single character
  • Can mask hidden blanks

How to recognize a tab delimited file

  • Partially aligned columns
  • Everything is left justified

This is an example of a tab delimited file

Fat Sodium  Calories
19  920 410
31  1500    580
34  1310    590
35  860 570
39  1180    640
39  940 680
43  1260    660

This file is not tab delimited

Alpine           14.1 0.86 0.9853 13.6
Benson&Hedges    16.0 1.06 1.0938 16.6
BullDurham       29.8 2.03 1.1650 23.5
CamelLights       8.0 0.67 0.9280 10.2
Carlton           4.1 0.40 0.9462  5.4
Chesterfield     15.0 1.04 0.8885 15.0

Tab delimited? Maybe, maybe not

  9 1.7080 57.0   F     N
  8 1.7240 67.5   F     N
  7 1.7200 54.5   F     N
  9 1.5580 53.0   M     N
  9 1.8950 57.0   M     N
  8 2.3360 61.0   F     N

A simple tab delimited file

x   y
1   4
2   8
3   12
4   16

Using the read_tsv function

simple_tab <- read_tsv(
  file="../data/simple.tsv",
  col_names=TRUE,
  col_types="nn")

glimpse(simple_tab)

Live demonstration, 2

Now, you will see a live demonstration of part 2 of simon-5505-03-demo.

Break #3

  • What you have learned
    • Tab delimited files
  • What’s coming next
    • Other delimiters

Anything can be a delimiter

x~y
1~4
2~8
3~12
4~16

Using the read_delim function

simple_tilde <- read_delim(
  file="../data/tilde.txt",
  delim="~",
  col_names=TRUE,
  col_types="nn")

glimpse(simple_tilde)

Live demonstration, 3

Now, you will see a live demonstration of part 3 of simon-5505-03-demo.

Break #4

  • What you have learned
    • Other delimiters
  • What’s coming next
    • Fixed width files

Reading fixed width format files

1 4
2 8
312
416

Disadvantages of fixed width formatting?

  • Confusing
    • What is 312?
      • 3, 1, and 2?
      • 31 and 2?
      • 3 and 12?
      • 312?
  • More work
  • Prone to errors

Example where fixed width formatting is needed.

Helpful functions with read_fwf

  • fwf_empty()
    • Uses spacing to guess at column positions
  • fwf_widths()
    • Specifies column widths
  • fwf_positions()
    • Specifies start and end locations for each column

The read_fwf function

simple_fixed <- read_fwf(
  file="../data/fixed.txt", 
  col_types="nn", 
  col_positions = fwf_widths(
    c(1, 2),
    col_names=c("x", "y")))

glimpse(simple_fixed)

Live demonstration, 4

Now, you will see a live demonstration of part 4 of simon-5505-03-demo.

Break #5

  • What you have learned
    • Fixed width files
  • What’s coming next
    • Real world examples

Function arguments for advanced options

  • col_select=
  • na=
  • name_repair=
  • skip=

Example 1, binary.csv

Example 1, a brief description

Example 1, viewing the file in Notepad

Example 1, the code to peek at the data

 [1] "admit,gre,gpa,rank" "0,380,3.61,3"       "1,660,3.67,3"      
 [4] "1,800,4,1"          "1,640,3.19,4"       "0,520,2.93,4"      
 [7] "1,760,3,2"          "1,560,2.98,1"       "0,400,3.08,2"      
[10] "1,540,3.39,3"      

Example 1, the code to read the data

example_1 <- read_csv(
  file=url_1,
  col_names=TRUE,
  col_types="nnnn")

glimpse(example_1)

Example 2, barbershop-music.txt

Example 2, viewing the file in Notepad

Example 2, peeking at the data

 [1] "Singing\tPerformance\tMusic" "151\t143\t138"              
 [3] "152\t146\t136"               "146\t143\t140"              
 [5] "146\t147\t142"               "145\t141\t134"              
 [7] "144\t139\t140"               "133\t138\t132"              
 [9] "129\t135\t128"               "134\t125\t132"

Example 2, the code to read the data

example_2 <- read_tsv(
  file=url_2,
  col_names=TRUE,
  col_types="nnn")

glimpse(example_2)

Example 3, airport.txt

Example 3, peeking at the file on the web

Example 3, a description of the data

  • Here is an excerpt from the data dictionary.

VARIABLE DESCRIPTIONS:
Airport                               Columns 1-21
City                                  Columns 22-43 
Scheduled departures                  Columns 44-49 
Performed departures                  Columns 51-56
Enplaned passengers                   Columns 58-65
Enplaned revenue tons of freight      Columns 67-75
Enplaned revenue tons of mail         Columns 77-85

Example 3, the code to peek at the data

url_3 <- "http://jse.amstat.org/datasets/airport.dat.txt"
read_lines(
  file=url_3,
  n_max=10)

Example 3, Defining variable names and column locations

start_column <- c( 1, 22, 44, 51, 58, 67, 77)
end_column <-   c(21, 43, 49, 56, 65, 75, 85)
variable_names <- c(
  "airport",
  "city",
  "scheduled_departures",
  "performed_departures",
  "enplaned_passengers",
  "enplaned_freight",
  "enplaned_mail")

Example 3, the code to read the data

example_3 <- read_fwf(
  file=url_3,
  col_types="ccnnnnn", 
  col_positions=fwf_positions(
    start=start_column, 
    end=end_column,
    col_names=variable_names))

glimpse(example_3)

Break #6

  • What you have learned
    • Real world examples
  • What’s coming next
    • Your programming assignment

This programming assignment was written by Steve Simon on 2024-12-18 and is placed in the public domain.

Program

  • Create a single program to address the questions below.
    • Refer to the module 03 demonstration programs as needed.
    • Store your program in the src folder
    • Follow the naming conventions recommended for this class
    • Include the appropriate documentation

Question 1

The oyster dataset shows two different computer vision methods that can be used to estimate oyster weight and oyster volume. Please consult the data description and then review the dataset itself. This is a tab delimited file. Read in the file and show a glimpse of the data. No interpretation of the output is needed.

Question 2

The file diamond.txt is data from a study of diamond ring prices. Please consult the data description and then review the dataset itself. This is a fixed width text file. Read in the file and show a glimpse of the data. No interpretation of the output is needed.

Grading rubric

You will be evaluated using the general grading rubric for programming assignments.

Your submission

  • Save the output in html format
  • Convert it to pdf format.
  • Make sure that the pdf filename includes
    • Your last name
    • The number of this course
    • The number of this module
  • Upload the file

If it doesn’t work

Please review the suggestions if you encounter an error page.

Summary

  • What you have learned
    • Reading text files
    • Comma delimited files
    • Tab delimited files
    • Other delimiters
    • Fixed width files
    • Real world examples
    • Your programming assignment